-
Notifications
You must be signed in to change notification settings - Fork 0
Docs #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Docs #30
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduces APPLY_MIGRATIONS.md, a comprehensive runbook for safely applying schema migrations to production and staging environments. Includes checklists, commands, verification queries, troubleshooting steps, rollback procedures, and operational guidance for SREs and engineers.
Add runbook for applying database migrations
Introduces BACKFILL.md, a comprehensive runbook detailing procedures for safely backfilling historical hero_snapshots into normalized tables. Covers planning, execution, validation, monitoring, troubleshooting, rollback, and post-backfill tasks for SRE, engineering, and QA teams.
Add backfill runbook for historical snapshots
Introduces a detailed runbook for diagnosing and mitigating Postgres connection exhaustion incidents. Provides emergency steps, diagnostics, mitigation strategies, permanent remediation actions, and references for SREs and backend engineers.
Add Postgres connection exhaustion runbook
Introduces DB_RESTORE.md with detailed procedures for restoring the StarForge PostgreSQL database, including snapshot, logical dump, and point-in-time recovery workflows. Provides checklists, validation steps, troubleshooting, and communication templates for incident response.
Add database restore runbook documentation
Introduces a new runbook document for handling ETL failure spikes in the docs/OP_RUNBOOKS directory.
Introduces a comprehensive runbook for triaging, mitigating, and resolving sudden spikes in ETL worker failures for StarForge. Includes checklists, Prometheus queries, mitigation steps, common failure classes, communication guidelines, recovery procedures, and post-incident actions to support SREs and backend engineers during ETL incidents.
Etl failure spike
Introduces MIGRATION_ROLLBACK.md to document procedures for rolling back database migrations.
Introduces a comprehensive runbook for safely rolling back problematic database migrations. The guide covers triage, rollback strategies (down migration, restore from backup, app revert), verification steps, communication protocols, and post-incident actions to ensure data integrity and minimize downtime.
Migration rollback
Introduced a new runbook in the documentation to outline procedures for handling secret compromise incidents.
Introduces a comprehensive runbook for handling suspected or confirmed secret and credential leaks. Covers immediate containment, rotation, forensics, investigation, recovery, communication, verification, and post-incident hardening steps for various secret types and providers.
Secret compromise
Introduces a new runbook file for handling worker out-of-memory (OOM) issues in the documentation.
Introduces WORKER_OOM.md, a comprehensive runbook for triaging, mitigating, and recovering from Out-Of-Memory incidents affecting ETL and background workers. The document covers immediate containment, diagnostic steps, Kubernetes commands, code and workflow mitigations, resource tuning, and post-incident actions to improve reliability and prevent future OOM events.
Removed file extension from the runbook title for consistency with other documentation headers.
Update WORKER_OOM runbook title formatting
Introduces a new runbook file for QUEUE_BACKLOG in the OP_RUNBOOKS directory. This file will be used to document procedures and information related to queue backlog operations.
Updated the top-level titles in several documentation files to remove the file extension suffixes for consistency and improved readability.
Introduces a comprehensive runbook for triaging, mitigating, and resolving queue backlogs in StarForge. Covers detection, immediate actions, root cause analysis, safe scaling, dead-letter queue handling, recovery, and long-term prevention for Redis/BullMQ and DB-backed queues.
Queue backlog
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.